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Abstract 


In this paper we survey a number of methods for the 
ietection of cd3rupt ch£mges (such as failures) in stochastic 
dynamical systems. We concentrate on the class of lineeu: 
systems, but the basic concepts, if not the detailed ema- 
lyses, carry over to other classes of systems. The methods 
surveyed range from the design of specific failure-sensitive 
filte'-s, to the use of statistical tests on filter innovations, 
to the development of jump process formulations. Tradeoffs 
in complexity versus performemce are discussed. 
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I. Introduction 


With the increasing availeibility and decreasing cost of digital hara>- 
ware amd software » there has developed a desire in several disciplines for 
the development of sophisticated digital system design techniques that can 
greatly improve overall system performance. A good example of this can be 
found in the field of digital aircraft control (see, for example, Doolin 
[45] , Taylor [46] , and Meyer and Cicolani [47] ) , where a great deal of 
effort is being put into the design of aircraft with reduced static stabi- 
lity, flexible wings, etc. Such vehicles czm provide improved performance 
in terms of drag reduction and decreased fuel consunption, but they also 
require sophisticated control systems to deal with problems such as active 
control of unstable aircraft, suppression of flutter, the detection of 
system failures, and management of system redundancy. The demands on such 
a control system are beyond the capabilities of conventional aircraft 
control system design techniques, and the use of digital techniques is 
essential. 

Another example ceui be found in the field of electrocardiography. 

In recent years a great deal of effort has been devoted to the development 
of digital techniques for the automatic diagnosis of electrocardiograms 
l&CG's; see, for example, [47]). such systems can be for preliminary 
screening of large numbers ECG's, for the monitoring of patients in a 


hospital, etc 


In this paper we review some of the recent work in one area of system 
theory that is of importance in both of these exanples, eis well as in 
many other system design problems. Specifically, we will discuss the pro- 
blem of the detection of abrupt chemges in dynamical systems. In the 
aircraft control problem one is concerned with the detection of actxiator 
and sensor failures, while in the ECG analysis problem cme wants to detect 
arrhythmias —sudden changes in the rhythm of the heart. For the sake of 
sin5>licity in our discussion, we will refer to all such abrupt changes as 
"failures", although, as in the ECG exanqple, the abrupt change need not 
be a physical failure. Our aim in this survey is to provide em overview 
of a number of the basic concepts in failure detection, Tlie problem of 
system reorganization subsequent to the detection of a failure is consi- 
dered in several of the references. We will point out these references in 
the sequel, but we will concentrate primarily on the detection problem. 

The design of failure detection systems involves the consideration 
of several issues. One is usually interested in designing a system that 
will respond rapidly when a failure occurs; however, in high performance 
systems one often cannot tolerate significzmt degradation in performance 
during normal system operation. These two consideration are usually in 
conflict. That is, a system that is designed to respond quickly to certain 
abrupt changes must necessarily be sensitive to certain high frequency 
effects, and this in turn will tend to increase the sensitivity of the 
system to noise (via the occurrence of false alarms signaled by the 
failure detection system) . The tradeoff between these design issues is 
best studied in the context of a specific example in which the costs of 
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th>3 various tradeoffs can be assessed. For example, one might be more 
willing to tolerate false alarms in a highly redundant system configuratior 
than in a system without substantial back*-\^> cap£d3ilities • 

In general, one would like to design a failure detection system that 
takes system redundancy into account. For example, in a system containing 
several back-up subsystems we may be able to devise a simple detection 
algorithm that is easily inqplemented but yields only moderate false alarm 
rates. On the other hand, by implementing a more complex failtire detection 
algorithm that tauces careful account of system dynamics, one may be edsle 
to reduce requirements for costly hardware redundancy. 

In addition to taking hardwaire issues into consideration, the designer 
of failure detection systems should consider the issue of computational 
conplexity. One clearly needs a scheme that has reasonad)le storage and time 
requirements. It would also be useful to have a design methodology that 
admits range of implementations, allowing a tradeoff study of system 
conplexity vs. performamce. In addition, it would be desirable to have a 
design that takes advantage of new computer capabilities and structures 
(e.g. designs that are amenable to modular or parallel inp lament at ions) . 

In this paper we survey a variety of failure detection methods, and, 
keeping the issues mentioned above in mind, we will conment on the 
characteristics, advamtages, disadvantages, and tradeoffs involved in the 
various techniques. In order to provide this survey with some organization 
and to point out scxne of the key concepts in failure d,tection system design, 
we have defined several categories of failure detection systems and have 


placed the designs we have collected into these groups. Clearly such a 
grouping can only be a rough approximation, and we caution the reader 
against drawing too much of an inference about individual designs based 
on our classification of them (several of the techniques could easily 
fall into a number of our classes) . In addition, for the sake of brevity 
we have limited our detailed discussions to only a few of the many 
techniques. Our choice of those techniques has been motivated by a desire 
to sp 2 m the range of available methods and by our familiarity with certain 
of these algorithms- Finally, we have attempted to collect all of those 
studies of the failure detection problem of which we are aware, and we 
apologize fc:r any oversights. 


II. Formulations of the Failure Detection Problem 

In this paper we are mostly concerned with the analysis of linear 
stochastic models in the standard state space form 
System Dynamics 

x(k+l) « «(k)x(k) + B(k)u(k) + w(k) (1) 

Sensor Equation 

z(k) » H(k)x(k) + J(k)u(k) + v(k) (2) 

where u is a known input, and w and v are zero-mean, independent, white 
Gaussian sequences with covariances defined by 


kj 


kj 


E[w(k)v*(j)l - Q6, 
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E(v(k)v* (j) ] “ r6, 


(3) 
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where 6^^ is the Kronecker delta. We think of (l)-(3) as describing the 
"normal operation" or "no failure" model of the system of interest. If no 
failures occury the optimal state estimator is given by the discrete Kalman 
filter equations [33] 


x(k+l|k) = ®(k)x(k|k) + B(k)u(k) 

(4) 

x(k|k) = x(k|k-l) + K(k)Y(k) 

(5) 

Y(k) = z(k) - H(k)x(k|k-1) - J(k)u(k) 

(6) 


where y is the zero-mean, Gaussian innovation process, euid the gain K is 
calculated from the equations 


P(k+l|k) = $(k)P(k|k)«*(k) + Q 

(7) 

V(k) = H(k)P(k|k-?jH* (k) + R 

(8) 

K(k) = P(k|k-1)H' (k)v’^(k) 

(9) 

P(k|k) = P(kjk^l) - K(k)H(k)P(k|k-l) 

(10) 


Here P(i|j) is the estimation error covariance of the estimate x(ijj),. d 
V(k) is the covarieuice of Y(k). We refer to (4) -(10) as the "normal mode 
filter" in the sequel. 

In addition to the above estimator, one may also have a closed loop 
control law, such as the linear law 

u(k) = G(k)x(k|k) (11) 

We then obtain the normal operation configuration depicted in Figure 1. 


The problem of failure detection is concerned with the detection of 
abrupt changes in a system , as modeled by (1)~(3). Such abrupt changes 
can arise in a number of ways. For example, in aerospace applications, 
one is often concerned with the failure of control actuators and surfaces. 
Such abrupt changes can mcmifest themselves shifts in the control gain 
matrix B, increased process noise, or as a bias in equation (1) (as might 
arise if a thruster developed a leak [31]). In addition, failures of 
sensors may tedce the form of abrupt changes in H, increases in measurement 
noise, or as biases in (2) . For sinqplicity, we will refer to abrupt 
chcuiges in (1) as "actuator failures," emd shifts in (2) will be called 
"sensor failures." Again we point out that in numy applications shifts 
in (1) or (2) may be used to model changes in observed system behavior 
that have nothing to do with actuators or sensors. 

The main t;;sk of a failure detection and compensation design is to 
modify the normal naode configviration in order to include the capability 
of detecting abrupt changes and compensating for them by activating back- 
up systems, adjusting the feedback design appropriately, etc. Conceptually, 
we think of the detectlon-con$>ensation system as part of the filtering 
portion of the feedback loop. As illustrated in Figures 2 and 3, the 
resulting filter design can take one of two forms. Either we perform a 
coR^lete redesign of the filter, replacing (4) -(10) with a filter that is 
sensitive to failures, or we design a system that monitors the normal 
system configuration and adjusts the system accordingly. We will discuss 
examples of both of these structures. 
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As mentioned earlier, we will concentrate prinieirily on the problem 
of failure detection, which we consider to consist of three tasks —alarm, 
isolation, and estimation. The alarm task simply consists of making a 
binary decision —either that something has gone wrong or that everything 
is fine. The problem of isolation is that of determining the source of 
the failure— e.g,, which sensor or actuator has failed, what type of 
arrhythmia has occurred, etc. Finally, the estimation problem involves 
the determination of the extent of failure. For example, a sensor may 
become con^letely non-operational (on "off" or "hard-over" failure) , or 
it may simply suffer degradation in the form of a bias or increased 
inaccuracies, which may be modeled as an increase in the sensor noise 
covariance. In the latter case, estimates of the bias or the increase in 
noise may allow continued use of the sensor, albeit in a degraded mode. 
Clearly the extent to which we need to perform these various tasks depends 
upon the application. If a human operator is available, we may only be 
interested in generating an alarm that tells him to perfozrm further tests. 
In other systems in which back-ups are available, we might settle for 
failure isolation without estimation. On the other hand, in the absence 
of hardware redundancy, we may be interested in using a degraded instrument 
and thus would need estimation information. 

Intuitively we can associate increased software system complexity with 
the tasks — i.e., isolation requires more sophisticated data processing 
than an alarm, and estimation more than isolation. On the other side, as 
«re increase failure detection capabilities, we may be able to decrease 


hardware redundancy. Also, in some applications we may be able to delay 
Isolation and estimation until after an alarm has been sounded. In such 
a sequential stri'cture, one increases detector complexity after a failure 
has been detected, thereby reducing the computational burden during normal 
operation. Again the details of such considerations depend upon the 
particular application. 

Another tradeoff involving failure detection rystem complexity 
involves its relation to detection system performance. For example, one 
might expect that one could achieve better alarm performance by using a 
priori knowledge concerning likely failure modes. That is, by looking for 
specific forms of system behavior that are characteristic of certain failures, 
one should be able to improve detection performance. Thus, it seems likely 
that alarm performance (as measured by the tradeoff between false alarms 
and missed detections) will be inproved if we attenpt simult2meaous detection, 
isolation, and estimation of failures. This tradeoff of complexity vs. 
performance is extremely important in the design of failure detection 
systems. 

In the following sections we will discuss several failure detection 
methods and will comment on their characteristics with respect to the 
Issues mentioned in this and the preceding section. We have not provided 
a general set of failure models to be considered, as the various techniques 
are based on quite different failure models. Ibese will be described as 
we discuss the various methodologies. 


Ill, "Failure-Sensitive'* Filters 

Our first class of failure detection concepts is aimed at overcoming 
the problem of an "oblivious filter". As has been noted by many authors 
Ul-[31, [33], the optimal filter defined by (4)-(10) performs well if 
there are no modelling errors; however, it is possible for the filter 
estimate to diverge if there are substantial unmodeled phenomena. The 
problem occ\irs because the filter "learns the state too well" — i.e. the 
precomputed error covariance P aiid filter gain K become small, and the 
filter relies on old measurements for its estimates and is oblivious to 
new measurements. Thus, if am abrupt change occurs, the filter will 
respond quite sluggishly, yielding poor performance. Consequently, one 
would like to devise filter designs that remain sensitive to new data so 
that abrupt chamges will be reflected in the filter behavior. 

TWO well-known techniques for keeping the filter sensitive to new 
data are the exponentially age-weighted filter studied Fagin [1] and 
Tam and Zaborszky [2] and the limited memory filter proposed by 
Jazwinskl [3] . Others, such as increasing noise covariances or simply 
fixing the filter gain are discussed by Jazwinskl in [33] , These tech- 
niques yield only indirect failure information. That is, if an abrupt 
change occurs, these filters will respond faster than the normal filter, 
and one can base a failure detection decision on sudden changes of x. 

It is important to note a performance tradeoff evident in this method. 
As we increase our sensitivity to new data, (by effectively increasing the 
bandwidth of the Kalman filter) , our system becomes more sensitive to sensor 
noise, and the performance of the filter under no-failure conditions 


degrades. In some cases this can be rather severe, and one may not be 
able to tolerate the degradation in overall system perform 2 uire under no- 
failure conditions. One might then consider a two filter system — the 
noznal mode filter (4) -(10) as the primary filter, with this type of failure- 
sensitive filter as an auxiliary monitor, used only to detect eibrupt 
changes. We remark that the tradeoff between detection performance and 
filter behavior under normal conditions is a characteristic of all failure 
detection systems and is 2 uialogous to the costs associated with false alarms 
and missed detections in standeurd detection problems [41] . 

The techniques mentioned so far in this section are rather indirect 
failure detection approaches. Several methods have been developed for 
the design of filters tlxat are sensitive to specific failures. One method 
involves the inclusion of several "failure states" in the dynamic model 
(l)-(3). Kerr [25] has considered a procedure in which failure modes, 
such as the onset of biases, are included as state variables. If the 
estimates of these variables vary markedly from their nominal values, a 
failure is declared. A two-confidence interval overlap decision rule for 
failure detection using such failure states is described and its performance 
is analyzed in [25] , Note that this approach provides failure isolation 
and estimation as the expense of increased dimensionality and some perfor- 
mance degradation under no-failure conditions (inclusion of the added states 
effectively opens up the bandwidth of the Kalmam filter) . 

An alternative to the addition of failure states to the dynamic model 
is the class of detector filters developed by Beard [4] and Jones [5] , 

T)ieir work has led to a systematic design procedure for the detection of a 
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wide variety of abrupt changes in linear time- invariant systems. They 
consider the continuous-time, time- invariant, detextninistic system model 


x(t) 

« Ax(t) + Bu(t) 

(11) 

Z(t) 

« Cx(t) 

(12) 


and design a filter of the form 

x(t) = Ax(t) + D(z(t)-Cb^(t)) + Bu(t) (13) 

The pr jt: y jriterion in the choice of the gain matrix D is not that (13) 
provide jl '^wOd estimate of x (as it is with observers or optimal estimators) , 
but rather that the effects of certain failures are accentuated in the 
filter residual 

Y(t) * 2(t) - Cx(t) (14) 

The basic idea is to choose D so that particular failure modes manifest 
thoBselves as residuals which remain in a fixed direction or in a fixed 
plane. 

To illustrate the Beeurd- Jones approach, let us consider a sinqple example 
from [4] . Suppose we wish to detect a failure of the ith actuator (i.e. 
in the actuator driven by the ith component of u) • If we assume the failure 
ta)ces the form of a constant bias, our state equation becomes 

i(t) » Ax(t) + B[u(t)+ ve^l 


« Ax(t) + Bu(t) + Vbj^, 


(15) 
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where Is the ith standard basis vector » is the ith column of B, and 
t^ is the (unknown) time of failure. Suppose we consider the case of full 
state measurement — i.e., let OI. In this case we obtain a differential 
equation for the residual 


Y(t) = [A-DlY(t) + vb^ 


(16) 


If we choose IM7I ^ A, we obtain 


Y(t) = -OY(t) + Vb^ 


-a(t-tQ) v(l-e"^^) 

Y(t) - e ” YUq) + i-b^ 


(17) 


Thus, as the effect of the Initial condition dies out, y(^) maintains a 
fixed direction (b^) with magnitude proportional to failure size(U). 

Note that as we increase a (thus increasing filter gain) , the initial 
condition dies out faster, but the magnitude of steady-state value of y 
decreases. Thus, if there is any noise in the system, we cannot make O 
arbitrarily large. 

In their vK>rk Beard 2 md Jones consider the design of such filters 
for an extremely wide variety of failure modes, including actuator and 
sensor shifts and shifts in A and B. The initial deterministic £malysis 
for all of these cases was considered by Beard [4] , while a systematic 
design procedure is given by Jones [5] for the design of the gain D to 
allow detection of several failures inodes. Jones' approach is quite 
geometric in nature^ and his formulation allows one to gain considerable 
insight into the detection problem. As pointed out in [5] , the gain 
selection problem is quite similar to the output decoupling problem and 
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requires the introduction of the important concept of "mutually detectable 
failure inodes" in order to answer the question of whether or not one c^m 
simultaneously distinguish between several types of failures. Thus the 
question of failure isolation is of central importance in the design 
methodology derived in [5]. 

The results in [4] , [5] represent perhaps the most thorough study of 
the basic concepts underlying failure detection. The tradeoff between 
detection and filter performance is discussed in depth in [5] and an 
attempt is made in [4] to introduce the concept of the level of redundancy 
in a dynamicad. system. 

As mentioned in the exanple, the basic design procedure is determi- 
nistic. However, in this simple example we cam see how one caui take 
noise into account. If the system (11), (12) contains noise, we have seen 
that one may not wish to make the scalar a as large as possible. In fact, 
one could choose ^ so as to minimize the mean-square estimation error in 
the detector filter when there is no failure. In his thesis [5] , Jones 
describes a procedure in which one first chooses the structure of D for 
failure detection purposes euid then chooses the remaining free parcuneters 
in order to minimize the estimation error covariance. Although this yields 
a suboptimal filter design, it may work quite well, as it did in the 
problem reported in [5] . 

In siunmary, the Jones-Beard design methodology is extremely useful 
conceptually, can be used to detect a wide variety of failures, and provides 
detailed failure isolation information. It is suboptimal as an estimator, 
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and if this presents a serious problem, one might wish to use the detector 
filter as an auxiliary monitorinq r.ystcm. This appears to be only a minor 
drawback, and the major limitation of the approach is its applicability only 
to time- invariant systems. 

IV. Voting Systems 

Voting techniques are often useful in systems that possess a high degree 
of parallel hardware redundancy. Hemoryless voting methods can work quite 
well for the detection of "hard" or large failures, and the papers of 
Gilmore and HcKern [6] , Pejsa [7] , and Ephgrave [8] discuss the successfxil 
application of voting techniques to the detection of hard gyro failures in 
inertial navigation systems. 

In st^mdard voting schemes, one has (at least) three identical ins- 
truments. Single logic is then developed to detect failures amd eliminate 
faulty instruments, for example, if one of the three redund 2 uit signals 
differs nuurkedly from the other two, the differing signal is eliminated. 
Recently, Broen [9] has developed a class of voter-estirmtors that possesses 
advantages relative to standard voting techniques. Consider the dynamical 
system 

x(k+l) = fec(k) (18) 

with a triply redundant set of sensors 

yj^(k) = Hj^x(k) + v^(k) 

yj(k) c H2x(k) + V2<k) (19) 

y3<k) - 


H 2 x(k) + v^(k) 
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Broen develops a set of recursive filter equations for coir^uting the es- 

A 

timate x(k) that minimizes 


k 

E 

i*»0 


3 

Ew 

j»l 


Y'.(i)RTV(i) 

31 3 3 j 


( 20 ) 


tdiere R. is the covariance of the measurement noise v 
3 

innovations sequence 


j' 


and y. is the 


y.(i) = y.(i) - '^x(k) ( 21 ) 

3 3 3 

Here w.. is a functions of y, (i) r y^(i)r y-.(i) which is large if y. (i) 
jx 12 3 j 

is close to the other two y (i) and is small if y.(i) deviates greatly 

m 3 

from the other two. In this way, one obtains a "soft" voting procedure 
in which faulty sensors are smoothly removed from consideration. This 
greatly alleviates the cost of false alarms, but the price is the on-line 
computation of the filter gain (which is a function of the w^^). Note that 
in equation (19) , Broen appeeurs to allow the y^ to be physically different 
sensors (different H^*s), but the emalysis of his paper makes it cle^u: that 
he requires identical sensors — i.e, Hj^=H 2 =H 2 * 

Voting schemes are in general relatively easy to implement and usually 
provide fast detection of hard failxires, but they are only applicable in 
systems possessing a high level of parallel redund 2 uicy. They do not in 
general teUce advantage of redundant information provided by unlike sensors, 
and thus cemnot detect failures in single or even doubly redundant sensors. 
In addition, voting techniques can have difficulties in detecting "soft" 
failures (such as a small bias shift) . 



V. Multiple Hypothesis Fllter*-Detectors 

A rather large class of adaptive estimation and failure detection 
schemes involves the use of a "bank" of linear filters based on different 
hypotheses concerning the underlying system behavior. In the work of 
Athans and Willner [10] and Lainiotis [11] , several different sets of 
system matrices are hypothesized. Filters for each of the models are cons- 
tructed, and the Innovations from the v 2 u:ious filters are used to confute 
the conditional probability that each system model is the correct one. 

In this manner, one can do simultaneous system identification and state 
estimation. In addition, cm abrupt chamge in the probabilities C 2 m be 
used to detect changes in true system behavior. This technique has been 
invescigated in the context of the adaptive control of the F-8C digital 
£ly-by-wise aircraft by Athans, Dunn, Greene, et.al., [35] amd also has 
been applied to the problem of classifying rhythms and detecting rhythm 
shifts in electrocardiograms. Extremely good results in the latter case 
are reported by Gustafson, Willsky, and Wang in [36]. 

Techniques involving multiple hypotheses have also been used to 
design failure detection systems. Montgomery, Caglayan, smd Price, [12] , 
[13] have used such a technique for digital flight control systems and have 
studied its robustness in the presence of nonlinearities via simulations. 
Recently a technique involving a bank of observers has been devised [34] , 
and a successful application to a hydrofoil sensor failure problem is 
reported by Clark, Fosth, and Walton in [34] . Also, Willsky, Deyst, and 
Crawford [15], [16] have applied the methodology devised by Buxbaum and 
Haddad in [14] to study failure detection for an inertial navigation 
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problem. We will briefly describe this technique to illustrate some of 
the concepts underlying the back of filters approach. We also refer the 
reader to Wemersson [42] for a technique that is similar to that discussed 
in [16] , 

Consider the system 

x(k+l) = $(k)x(k) + w(k) (22) 

z(k) = H(k)x(k) + v(k) (23) 


We are interested in detecting sudden shifts in certain of the components 
of X (e.g. , bias states). We model these shifts by choosing the distribu- 
tion of w appropriately. Let {f, ,..,,f } be the set of hypothesized failure 

1 r 

directions. We then assume that w has a high probability of being the usual 
process noise and a small probability of including a burst of noise in each 
of the failure directions. Thus the density for w(k) is 

r 


PoN(0,Q) + 


E 


p.N(0,Q+C.f .f!) 
1 111 


(24) 


£ i»l,...,r (25) 

1-0 


Here N(m,p) is a normal density with mean m and covaricuice P. 

If we hypothesize such a density at each point in time and if we 
assume that x(0) is normally distributed, we have the following expression 
for the conditional density of x(k) given z(l) ,...,z(k) : 


i 


r r 



P^N(n^,P.) 


p(x,k) 


(26) 
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Here (i^f . . . » and the density has the following interpretation. 

Let ® ‘ * random k-triple where = i if there is a shift 

in the f^ direction at time s (i=0 is used to denote no shift). Then 

Pi = Pr(j=i|2(l) ,...,z(k)) (27) 

and ^i and Pi are the mean and covariance of the Kalman filter designed 

assuming (i.e. assuming w(s) has covariance Q+C^i fi f’i ). The £i can 

s s s 

be computed in a sequential manner as a function of the various filter 
innovations. We refer the reader to [141-1161 for the details of the 
calculations. 

Note that the implementation of (26) r. |uires an exponentially growing 
bank of filters (there are (r+1) terms in (26)). To avoid this problem 
a number of approximation techniques have been proposed [141- [161. The one 
used in [161 involves hypothesizing shifts only once every N steps. At the 
end of each N step period we "fuse” the (r+1) densities into a single den- 
sity £uid begin the provedure again. In this way we implement only (r-i-1) 
filters at any time. We note that the techniques devised in [101 -[121 do not 
involve growing banks of filters (as the number of hypothesized models do 
not grow in time) . However, it is possible for all of the filters in the 
bank to become oblivious, and thus shifts between the hypotheses may go un- 
detected (see [161, [361 for examples). The technique of periodic fusing 
of the densities and initiation of new bank effectively avoids this prc^lem 
(as would designing the original ):>ank using age-weighted filtering techniques) . 



-19- 


The technique described above was applied to the problem of detecting 
gyro and accelerometer bias shifts in a time-varying inertial calibration 
and alignment system. The results of these tests are extremely impiessive. 
This is not surprising, as the multiple hypothesis method confutes precisely 
the quantities of interest— the probabilities of all types of failures 
under consideration. The cost associated with such a high level of perfor- 
mance is an extremely cotnplex failure detection system. Note, however, 
that the parallel structure of the system allows one to consider highly 
efficient parallel processing computer implementations. In addition, the 
use of reduced-order filters for the various failure hypotheses may Increase 
the practicality of such a scheme, or one might consider the use of a 
simpler detection-only system to detect failures, with a switch to a multiple 
hypothesis procedure for failure isolation and estimation after a failure 
has been detected. 

However, even if such a failure detection scheme cannot be implemented 
in a particular application, it provides a useful benchmark for con^arison 
with simpler techniques. In addition, by studying the simulation of a 
multiple hypothesis method, one can gain useful insight inuo the dynamics 
of failure propagation <md detection (see the discussion in [16]). 

McGarty [23] has developed a method for rejecting bad measurements 
that bears some similarity to the approach just discussed. Each measure- 
ment has a binary random variable g(k) associated with it. If g(k)=l 
the measurement is "good", (i.e. the measurement contains the signal of 


interest) f while g(k)"0 denotes a bad data point (the measurement is pure 
noise) • McCarty devises a maximum likelihood approach for estimating the 
values of the exponentially growing set of possibilities (g(i)ai or 0, 
l«l,.,.,k). He also allows these variables to have a sequential correlation 
(i.e. knowing that the present measurement is good or bad says something 
about the next observation) , A c(xnputationally feasible approximation 
method is devised and simulation results are described. We refer the reader 
to [23] for details. 

Recently, Athans, Whiting, and Gruber [51] have also considered the 
problem of designing an estimator that can detect and remove bad or false 
measurements. Their approach is Bayesian in nature — i.e. an estimate is 
generated of the a posteriori probability that a given measurement is 
false. The method of calculation of these pseudo-probabilities is quite 
similar to that used in the other multiple hypothesis methods (see [10] -[14]). 
The reader is referred to [51] for details of the emalysis and for a dis- 
cussion of some successful simulation results , 

VI. Jump Process Formulations 

The problem of the detection of abrupt changes in dynamical systems 
suggests the use of jump process techniques in devising system design 

methodologies (see [39], [49] -[50] for general results on jun^) processes). One 
models potentied failures as jun^s, characterized by a priori distributions 
which reflect initial information concerning failure rates. The size of 
the possible failures are usually taken to be known. One could, however, 
model failure magnitude, as a random variable. This leads to a compound 
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jump process formulation which greatly con^licates the desired analysis. 

In any event, taking such a juc^ process formulation, one can devise 
failure-sensitive control laws and methods for computing the conditional 
probability of failure. Control problems of this type have received a 
great deal of attention in the literature. Sworder, and Robinson [17] -[20], 
[37] and Ratner and Luenberger [21] have considered the design of control 
.^aws which take into account the possibility of sudden shifts in system 
matrices. The results they have obtained are for the full-state feedback 
problem with no system randomness other than the jumping of the system 
matrices among a finite set of possible matrices. 

Davis [22] has utilized nonlinear estimation techniques to solve a 
fault detection problem. His formulation is as follows: consider the 

scalar stochastic equations 

dx(t) • a(t)x{t)dt + g(t)dv(t) (28) 

dy(t) » h(t)x(t)dt + dw(t) (29) 

where w and v are Independent Brownian motion processes and 


a(t) - aQ(t)[l-C(t)] + a^(t)e<t) 

where 


{ 0 t<T 

1 t>T 


(30) 


(31) 


and T is a random variable. Here we Interpret a^ as the unfalled dynamics, 
and a^ represents the failure mode. Davis derives the optimal » infinite- 
dimensional equations for the computation of the conditional mean of x emd 
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the conditional probability 

^(t{t) ■ Pr[t^|y(s), O^s^tl (32) 

An io^lenentable approximation is described in [22] , but evaluation of its 
performance has not as yet been made. 

Note that Davis' method leads to an estimate of x that is sut>optimal 
to under no- failure conditions. Chien [24] has devised a jump process 
formulation that avoids this difficulty for the probl^ of the detection 
of a jump or a raiqp in a gyro bias. He considers the dynamical model. 

kit) •mit) + w(t) (33) 

where w is a white noise process. Three hypotheses are conjectured for the 
form of the gyro output 
Normal Mode 

z(t) • x(t) + v(t) Vt (34) 

Bias Node 

z(t) ■ x(t) + raC(t) + v(t) t>T (35) 

Ramp Node 

z(t) * x(t) + n(t-T)C(t) + v(t) t>T (36) 

where n and m are xinJcnown constants, v is white noise, T is the time of failure, 
and C(t) is as in (31). 

Chien *s approach Js as follows^ design a filter based on Hq (which will 
thus yield the optimal estimate for t<T, asstiming no false alarms 
occur) , and determine the steady-state effect of the degradations and 
on the filter residuals. If one then hypothesizes a failure rate q — i.e. 

P(T>t) - e"*^ 


(37) 
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emd If one further assumes a nominal size for the bias m, one can then compute 
an approximate stochastic differential equ'^tion for Pr(H^|z(s), s^t), in which 
the input to this equation is the residual y of the filter. The details 
of the analysis are described in [24] . 

For his problem Chien is able to demonstrate that his detection proce- 
dure**" based on the assumption of a nominal value for the bias failure m — 
has the capability of detecting biases larger than m and also can be used 
to detect ramps (mode h^) • Of course, the delay times until detection in 
these cases are greater than if one inplemented a filter based on the proper 
bias size or if one were looking for a ramp (indicating the potential 
usefulness of estimating the failure magnitude) • The major advantages of 
Chien 's accroach are the simplicity of the detector (in^^lementation of a 
scalar stochastic equation) and the fact that one obtains an estimate of 
precisely the quantity of inierest — the conditional probability of 
failure* The simplicity of the scheme may, in fact, make it a great deal 
more robust in the face of system modelling errors (such as the use of 
an extremely simplified gyro error model) than more sophisticated approaches. 
Also, this approach leads to no degradation in performemce prior to detec- 
tion of the failure. In addition, the use of a probabilistic description 
of the time of failure allows one to avoid the problem of the oblivious 
filter — i.e. the fact that a failure can occur at any time has been 
incorporated in the design, which therefore will remain sensitive to new 


data 
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The dza«4>acks of the scheme are the use of a fixed bias size and the 
use of the steady-state effect of the failure on the filter residual. The 
first of those may not be too much of a problem (as Chien has pointed out ) , 
but the second may cause difficulties. Specifically, this limits the approach 
to time- invariant systems and filters. In addition, as the tremsient effect 
of the failure has been ignored, it may be difficult to medce quick detections 
of certain changes (i.e. we may have to wait unti the transient dies out). 

In the next section we will discussed an approach (the GLR method) which has 
several concepts in common with Chien 's approach and which allows one to 
overcome these two dravdsacks (at the cost of added computational conplexity, 
of course) . 

In summary, jump process formulations ai^ar to be quite natural for 
failure detection problems. One usually makes approximations in the ema- 
lysis in order to obtain .''xplementable solutions. These simplifications 
impose some limitations on the capabilities of the designs, but there is 
at present no systematic analytical procedure for evaluating these limita- 
tions or for studying tradeoffs between desin complexity 2 uid system perfonn^mce. 

VII. Innovations -Based Detection Systems 

Chien 's failure detection technique can also be placed in the class of 
failure detection methods that involve the monitoring of the innovations of 
a filter based on they hypothesis of normal system operation. In such a 
configuration the overall system uses the normal filter until the innovations 
monitoring system detects some form of aberrant behavior. The fact that the 
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nonitoring system can be attached to a filter-controller feedback system 
is particularly appealing, since overall system behavior is not disturbed 
until after the monitor signals a failure £uid since the monitoring system 
can be designed to be added to an existing system, 

Hehra and Peschon [26] have suggested a number of possible statistical 
tests to be performed on the innovations. One of these is a chi- squared 
test %fhich was applied in [15] , [16] by Willsky, Deyst and Crawford. Let 
Y(k) be the p-dimensional innovations for the filter defined by (4) -(10). 

If the system is operating normally, the innovations is zero-mean and 
white with known covariance V(k) . In this case the quantity 

k 

^(k) = r (j)v“^(j)Y(j) (38) 

j=k-N+l 

is a chi-squared random variable with Np degrees of freedom [26] , [15] , [16] . 

If a system abnormality occnirs, the statistics of y change, and one can 
consider a detection rule of the form 

FAILURE 

«.(k) > e (39) 

NO FAILURE 

With the aid of chi-squared tables, one can compute the probability P of 
false alarm as a function of the innovations window length K and the decision 
threshold c. The probability P^^ of correct detection depends upon the 
particular failure mode (see [16] and the discussion of the GLR approach 
to follow) . We note that for a given failure mode, as N increases the 
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probability of correct detection may decrease ~ i.e. by averaging a larger 
nunber of residuals we smooth out the effect of a failure on Yr and the 
detector may become somewhat oblivious (or at the very best responds quite 
slowly) to new data. On the other hand, too small a value of N may yield 
an unacceptably high value of P . 

The implementation of the chi-squared test (38), (39) is quite simple, 
but, as one might expect, one pays for this sinqplicity with rather severe 
limitations on performance. As described in [15] , [16] this method was 
applied to the same inertial calibration and alignment problem to which the 
Buxbaum-Hadded multiple hypothesis approach [14] -[16], described in Section 
V was applied. The performzmce of the chi-squared test was mixed. The 
method is basically an alarm method — i.e. the system (38) , (39) makes no 
attempt to isolate failures — and one finds that those failure modes that 
have dramatic effects on y are detectable by this method; however more 
subtle failures are more difficult to detect with this single scheme. 

Con^ring the performance of the multiple hypothesis and chi-squeured 
systems, we see that in some cases we cem obtain superior alarm capabilities 
if we simultaneously attempt to do failxire isolation and estimation. One 
can obtain some failure isolation information by considering the con^nents 
of Y separately (this may be especially useful for sensor failures) , euid we 
refer the reader to [15] , [16] for a detailed discussion of this emd other 
aspects of the chi-squared method. 

Another innovations-based approach, developed by Merrill [27] , is moti- 
vated by a desire to suppress bad sensor data. Merrill devises a modification 
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of the least squares criterion in order to suppress extremely large resi- 
duals (which are given a very large weighting in the usual least squares 
framework ) , and he applies his methodology to a power system application. 

A final technique in this category has been studied by several 
researchers — Willsky and Jones [28] , [29] , McAulay and Denlinzer [30] , 
Deyst and Deckert [31], Sanyaland shen [32], and Chow, Chinn and Willsky 
[38]«M.and we will describe the most general formulation of the approach, 
developed in [28], [29]. This technique, which we call the generalized 
likelihood ratio (GLR) approach, was in part motivated by the shortcomings 
of the sispler chi-squared procedure. The GLR approach, which can be 
applied to a wide range of actuator and sensor failures, makes an attempt 
to isolate different failures by using knowledge of the different effects 
such failures have on the system innovations. The method provides am 
optimum decision rule for failure detection amd provides useful failvure 
identification information for use in system reorganization subsequent to 
the detection of a failure. In addition, one cam devise a nund>er of sim- 
plifications of the technique and cam study analytically the tradeoff 
between GLR complexity and GLR performance. 

Consider the basic dynamical model (l)-(3). The following are 4 
possible modifications of these equations that incorporate certain sudden 
syst^ chamges (see Willsky and Jones [28] , [29] and Gustafson, Willsky, and 
Wbng [36] for physical motivation for these amd other failure modes of the 
same general type) : 
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Dynamics Jump 

x(k+l) * ♦(k)x(k)+B(k)u(k) + w(k) + vfi. o (40) 

k+1,0 

Here v is an unknown n -vector, 6 is the unknown time of failure, and 6^^ 
is the Kronecker delta. Such a model can be used to model sudden shifts 
in bias states (as in the inertial problem studied in [15] , [16] ) . 

Dynamic Step 

x(k+l) = $(k)x(k) + E(k)u(k) + w(k) + va, , . (41) 

k+1,0 

Here is the unit step 



This model can be used to model certain actuator failures (con^are to the 
Beard-Jones exan^le in Section III| see equation (15)). 

Sensor Jun^ 

z(k) * Hx(k) + JuCk) + v(k) + v6. . (43) 

JCf 

We can use this to model bad data points. 

Sensor Step 

z(k) s Hx(k) + Ju(k) + v(k) + g (44) 

Sudden changes in sensor biases fit into this model. 

By the linearity<of the system (l)-(3) and the filter (4)-{l0), one 
can determine the effect of each of the failure modes on tlie innovations. 
The general form is 


Y(k) - G(k|0)V + Y(k) 


(45) 
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where Y(k) is the filter innovations if no failure occurs, emd the matrix 
G can be precomputed (see [29 ] , [38] ) . This matrix, which is different for 
each of the four cases (40) -(44), is called the failure signature matrix and 
provides us %7ith an explicit description of how various failures propagate 
through the system and filter. 

The full-blown GLR method involves the following: we assume we are 
looking for one of the four classes of failures and have computed the 
appropriate signature matrix. Given the residuals, we compute the maximum 
likelihood estimates of V and 9, and, assuming that these estimates are 
correct, we compute the log- likelihood ratio for failure versus no failure 
(see Van Trees [41] for a general discussion of GLR methods) . The imple- 
mentation of the full GLR requires a lineeurly growing bank of matched filters, 
computing the best estimates of v assuming a particular value of 0e{l,...,k}. 

A number of remarks can be made concerning the GLR system. We note that, 
as with rather methods such as Buxbauim-Haddad or Chien, the inclusion of the 
variable 6 to indicate our uncertainty as to the time of failvre keeps the 
detection system sensitive to new data. However, it is the estimation of 6 
that causes the growing complexity problem. On the other hand, even if the 
full GLR is not iiqilementable , it can serve as a benchmark for other schemes 
and can in fact be used as a starting point for the design of simpler systems. 
One simplification that eliminates the growing complexity is the restriction 
of the estimate of 6 to a window 


k-N < 0 < k-M 


(45) 
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%^ere the lower bound Is included to limit complexity, and the upper bound 
is set by failure observability emd false alarm considerations. Successful 
simulation runs with N^N (i.e., when we don't optimize 6 at all and have 
only one matched filter for v) are reported by Willsky emd Jones in [29] . 

We remark only that the price one pays for "windowing" the estimate of 0 is 
in a reduction in the accuracy of the estimate of V. For example, in the 

case of N^^M, we often are able to detect failures extremely quickly, but if 

A 

0*k-N is not the correct time of failure, the estimate of v may be severely 
degraded (e.g., our estimate of the slope of a ramp changes as we change 
our estimate of the time at which it started) • We note that the estimation 
of 0 is similar to time-of-arrival estimation problems that arise in various 
applications, and refer the reader to Vem Trees [44] for a general discussion 
of several techniques. 

Also, we note that even if the physical system auid filter axe time- 
invariant, the GLR monitoring system is time-varying, as the failure sig- 
nature G includes transient effects. In some cases one may be ed)le to 
neglect these and utilize a sin^ler steady state signature. This is quite 
similar to Chien's use [24] of the steady-state effect of the failure on the 
residuals, zuid the criticisms of that approach, given in Section VI, apply 
here as well. 

One can also simplify the io 9 >lementation by eitlier partially or 
completely specifying the failure magnitude v. Constrained GLR (CGLR) is 
based on the assunqption that 

V «af^ 


(46) 
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where a is 2 m unknown scalar emd Is one of r possible failure directions. 
This technique is described in [29] . If we completely specify V 

V»Vq (47) 

we obtain the simplified GLR (SGLR) algorithm which is extremely simple to 
implement, as we have completely eliminated the need for the matched filters 
to estimate v. The use of specified failure sizes is similar to that pro- 
posed by Chien (24) , although in SGLR one can use the time- varying failure 
signature, which should aid in failure detection. As initial results for 
the detection of electrocardiogram arrhythmias, indicate (see Gustafson, 
et.al., [36]) the estimation of V is not nearly as io^rtant for detection as 
the matching of failure signatures. Also, by the use of several values of 
Vq (i.e. by implementing several parallel SGLR's), one can achieve a high 
level of failure isolation without a great deal of additional software com- 
plexity. In addition, one could consider a "dual-mode" procedure in which 
SGLR is used for alarm and isolation, with full GLR used only afterward in 
order to estimate the magnitude of the failure. 

The various simplifications of GLR, as well as full GLR, are amenable 

to certain analysis, such as the calculation of P_, P_ and (at least for 

F D 

SGLR) the expected time delay in detection. By performing such analyses, 
one can study in detail the tradeoff between complexity and performance. A 
methodology for such comparisons is presently being developed emd is being 
applied to an aircraft failure detection problem. Initial results are 
reported by Chow, et.al., in [38], and a description of a detailed methodology 
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will be reported in the near future, (see Bueno, Chow, Gershwin, and 
Willsky [43]}, In addition, to the calculation of P and P , the comparison 
methodology reported in [43] Includes the c<»nputation of cross-detection 
probabilities — i.e. the probability of detecting a failure of type A when 
a failure of type B has occurred. Such information can be useful in 
designing failure isolation procedures emd also in determining if failure 
detector A can be successfully utilized as an alarm for failures of type B. 

This can lead to substantial simplifications in a failure alarm system. 

Also, we refer the reader to [29] , [36] , and [38] for successful simulations 
of the GLR method. 

Presently the GLR method is being extended to other failure modes, such 
as: 

Hard-Over Actuator Failure 

x(k+l) = 4>(k)x(k) + gluCk) + w(k) (48) 

With this model we can tedee into account complete (or "off*) failures of 
certain actuators. For exaxtple an off failure of the ith actuator can be 
modeled by choosing M all zero except for the ith column, which is taken to 
be the negative of the ith column of B. The GLR detector for (48) is presently 
under development [38], [43], and we note that this model is more difficult 
them the others as the effect of the failure is modulated by the input values 
u(k) . 


Increased Process Noise Failures 
x(k+l) « $(k)x(k) + B(k)u(k) + w(k) + C(k) 


k+i,e 


(49) 


Here C is additional white process noise. 

Hard-Over Sensor Failures 

x(k+l) = Hx(k) + Ju(k) + v(k) + [Mx(k) + Su(k)lCj^ ^ (50) 

Here the failures are modulated by u and x, and a failure of the ith sensor 
is modeled by choosing the ith rows of M and S appropriately. 

Added Sensor Noise Failures 

z(k) « Hx(k) + Ju(k) + v(k) + ^ (51) 

The analysis of these failure modes is presently being performed [38] , [43] , 
and it is anticipated that SGLR algorithms will also be developed. 

In addition to these failure modes , one can develop additional models 
along these lines for particular applications. In particular, we have 
developed several additional models similar to those described by equations 
(40) -(44) for our work on the detection and classification of arrhythmias in 
electrocardiogrcuns. The results reported in [36] are rather stricking, as 
in all the tests performed we observed no false alarms, detected all rhythm 
changes iimnediately (with no incorrect estimates of 6) , emd classified all 
rhythm changes correctly. These tests utilized the full GLR approach amd 
have provided useful insight into the characteristics of the method. For 
example, the use of mautimum likelihood estimates of V and 0 precludes the 
use of a priori statistics on these variables. In the ECG problem, one is 
quite interested in accurate estimates of V, and one also can come up with 
reasonable a priori statistics on v based on physical arguments. Thus, it 
may pay to incorporate such a priori statistics into the GLR system, and this 
can be done rather easily by proper initialization of the matched filters 
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estlmating v. On the other hand, for the EGG problem one does not want to 
look for abrupt changes at one px>.int in the record more than at another, 
and thus it does not make sense to Include a priori statistics on 0. In 
fact, one can argue that inclusion of a priori failure information tends 
to discount the observed data in order to avoid false alarms (unless failures 
are extremely likely) , and one should probably avoid the inclusion of such 
information unless one in especially worried edx>ut false alarms. However, 
if one wishes to use such data, one can utilize the interpretation of the 
likelihood ratios as ratios of conditional probedsilities of failure times in 
order to determine the appropriate modification of GLR [291 . 

Finally, we note that the (3iR system provides extremely useful informa- 
tion for system con^nsation subsequent to the detection of a failure. For 
example, one can utilize the GLR-produced estimates of v and 6 to determine 
an optimal update procedure for the filter estimate and covariance [29] . 

Once this update has been performed, the GLR system c^m be used to detect 
further failures, thus allowing the detection of multiple events. We refer 
the reader to [29] , [38] for further discussions of the use of GLR-produced 
information in the design of failure compensation systems. 

VIII. Conclusions 

In this paper we have discussed a nxunber of the issues involved in the 
design of failure detection systems. We have also reviewed a variety of 
existing failure detection methods and have discussed their characteristics 
and designs tradeoffs. The failure detection problem is an extremely 
complex one, emd the choice of an appropriate design depends heavily on the 
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particular application. Tsauos such as available computational facilities 
and level of hardware redundancy enler in a crucial way in the d<>sicjn deci- 
sion. For exa.iiple, as we have mentioned, the use of a sophisticated failure 
detection-compensation system nay allow one to reduce the level of hardware 
redundancy without much of a loss in overall system reliability. 

The development of failure detection methods is still a relatively 
new subject At this time most of the work has been at a theoretical level 
with only a few real applications of techniques 16] -[9], tl3] , [31], [36], 
Much work is yet to be done in the development of Implementable systems 
complete with a variety of design tradeoffs. Work is needed in the develop- 
ment of efficient techniques for failure compensation and system reorganisa- 
tion. In addition, there is a great need for the analysis of the robustness 
of various failure detection systems in the presence of variations in system 
parameters and in the presence of modeling errors and system nonlinearities. 
For example, it is conjectured that SGLR is less sensitive to parameter 
errors th^m the more complex full GLR; however, at present there are no ana- 
lytical results or simulations to support this conjecture. These and other 
issues, such as the incorporation of fault-toler£mt computer concepts into 
eui overall reliable design methodology (see Deyst [40] ) await investigation 


in the future 
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